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Question No. 1 * 0° marks) 

For each of the following* please eircle the letter introducing the best answer- each 
one is worth one mark: 

1 . What is an example of a null hypothesis? 

a) that a newly created model does not provide better predictions than the current l \ 
existing model 

h) that a newly created model provides a prediction of a null sample mean 

c) that a newly created model provides a prediction of a null population mean 

d) that a newly created model provides a prediction that will be well tit to the null 
distribution 

Answ er a 

2. In which phase of the data analytics lifecycle do Data Scientists spend the nl<|st time 
in a project? * 

a) Discovery 

b) Data Preparation 

c) Model Building 

d) Communicate Results 

Answ er b 4 

3. You have used k-means clustering to classify behavior of 100. 000 customers for a 
retail store. You decide to use household income, age. gender and yearly purchase 
amount as measures. You have chosen to use 8 clusters and notice that 2 clusters onl\ 
have 3 customers assigned. What should you do? 

a) Increase the number of clusters ' 

b) Decrease the number of measures used 

c) Decrease the number of clusters 

d) Identify additional measures to add to the analysis 

Answer c 1 

4. In which lifecycle tftage arc initial hypotheses formed? 

a) Discovery r f , 

b) Model planning* 

c) Model building 

d) Data preparation * 

Answ er a 

5. Consider the exampfe of an analysis for fraud detection on credit card usage. You will 

need to ensure higher risk transactions that may indicate fraudulent credit canmctii itv 
are retained in your data for analysis, and not dropped as outliers during pre- 
processing. What will be your approach for loading data into the analytical sandbox 
for this analysis? 1 

a) ELT 

b) ETL 

c) EDW 

d) OLTP 
Answer a 


rl 

1 


6. Which ke> role for a successful analytic project can consult and advise the project 
team on the value of end results and how these will be used on a day-to-day basis? 

a) Project Manager > « * 

b) Business User J ' r 

c) Data Scientist 

d) Business Intelligence Analyst 
Answer b 

7. A disk drive manufacturer, has a defect rate of less than 2% with 98% confidence. A 
quality assurance team samples 1000 disk drives and finds 14 detective units. W hich 
action should the team recommend? 

a) The manufacturing process should be inspected for problems. 

b) A larger sample size should be taken to determine if the plant is lunctioning 

properly '# 

c) A smaller sample size should be utken to determine if the plant is lunctioning 

properly * 

d) The manufacturing process is functioning properly and no further action is 
required. 

Answer il 

8. Which characteristic applies only to Business Intelligence as opposed to Data 

Science? , 

a) Supports solving "what if" scenarios 

b) Uses large data sets 

c) Uses only structured data 

d) Uses predictive modeling techniques 
Answer e 

9. When would you use a Wilcoxson Rank Sum lest? 

a) When you cannot make an assumption about the distribution of the populations 

b) When the data can easily be sorted 

c) When the populations represent the sums of other values 

d) When the data cannot easily be sorted 

Answer a > 

10. Which activity might be performed in the Operationalize phase of the Data Analytics 
Lifecycle? 

a) Try different analytical techniques 

b) Try different variables 

e) Transform existing variables 
d) Run a pilot 

Answer d 
Question No. 2 

A 

L V\ hat are the characteristics of Kig Data? 

The characteristics of Big Data are: 

1) Volume (size) ^ 

2) Velocity (rapidly? streaming) 

3) Variety (many forms) 

4) Veracity (Uncertainty of data) 

\ alue of data (well and good for access or useless data) 
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2. I sc i he k-means algnrit Inn anti Ku did can distance to cluster the follow ing eight 
examples into 3 clusters? Al (2,10), A2 ( 2.5 1 , V3*(K,4), \4 =(5,8), V$"(7,5), 
■V> =(6,4), A7 (1,2), AN (4,9). the Kncliilcnii distance is estimated using the 
r equation: 

d( (xj, yi), (j£j, yi) ) st | rt 

Suppose Ihiit (tie initial centers of each cluster lire A I, A 4 and \7. Run the k- 
means algorithm for 1 epoch (iteration) onl>. At the end of this epoch show: 
a) I lie new clusters (i.e. the examples belonging to each duster) 
h) I'hc centres of l lie new clusters 

c) Draw a Hi by 10 space with all the S points ami show the clusters alter the 
first epoch and the new centroids. 

d) How many more iterations arc needed In converge? Draw the result lor each 

epoch. (J marks) 

Solution: 

a) 

dfa.bj denotes die Hucledian distance between a and b. it is obtained directly from the distance matrix oi 
calculated as follows: d(a.b) " : sqit((Xb-X,/'-(Vb-\V")) 
seed I=A1 HI 1 0). seed? Al n 5 .8). aeedLV A7 ( 1 .2) 


epoch I - start: 


d( A 1 , seed 1)^0 as A I is seed! 
d(Al,seed2)= c'B >0 
d(A 1 , seed3)= \ 65 -Q^ 

■♦Al e clustei 1 ^ 


A3 , 

d(A3.seedI)= i/36 =6 i 

dt A3. seed2j= \'2? = 5 <■ smallei 

df A3. seed3 )= V53 = 7 28 
-4 A3 £ clustei 2 


A5: 

dfA5, seedl)= V '50 = 7.0 f 


d(A5. see<t2) = JH 3 60 ^ smallej 

dl A5. seed3)= ^45 6.70 

-> A5 *: cluster! 


A 7: 

dr A 7. ieedl)= <J65 O 
d(A7 seed2)= /S2 0 
d(A7. 0 as A 7 is *t*ed3 

4 A7 e cluster! 

end of epoch I 


A3: '• 

d(A2^jSl)= JE =5 
d(A2. sedd2)= v IS “4,24 
d(A2, seed3)= vlO - 3.16 ^smaller 
*♦ A2 £ cluster} 

A4 : t 

d(A4 1t seedl )= VI 3 
d(A4, seed!) 0 as A4 is seed! 
d(A4. $eed3)= v 52 0 
4 A4 e cluster! 

A6 

diA6. seedl)= fil « 7.21 

( d(A6, seed2)= vl7 * 4,12 smaller 

d(A6, seed3>~ v 29 = 5 3S 
*' -4 A6 € cluster! 

t 

A8: 

d(A8. seed 1 )= -J$ 
d('AH. seed 2 V V 2 4* smaller 
d(A&. seed.' 1 V58 
♦ AS ^ clmtej! 


new charters 1 {Al}.2 (A3. A4, A 5 A6. AS{. 3 (A2, A7J 
b) ceu ret s of the new clusters 1 

t 1 <2. 10). C2 - <(8+5+7+6+4) l 5 1 (4+8+5+4+9)/5) - (6. 6), C3- ((2+1) 2. (5+2) 2) = (1.5. 3.5) 



d ) , 

We would need two more epochs After the 2 epoch the results would be. 

I {A1.A8K2: {A3, A4, A5. A6}. 3 {A2.A7} 

with centers Cl=(3. 9.5). C2={6.5. 5.25) and C3=(1.5, 3.5). 

After the 3' d epoch the results would be; 

1: {Al, A4, AS), 2 {A3. AS, A6}.3: {A2.A7J 

with centers Cl=(3 66. 9). C2=(7, 4.33) andC3=(1.5 ( 3.$) 
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1 Students »cre given different drug freatments before revising for their exams. 
' ' Some were given , memory .Irug, some » ptaceh.. tint* »nd ««» 

The wan. scores <”/.) are given helon lor the three d'N«« - I- , iff j , 
a one-wav ASOVA to test the hypothesis that the treatments « .11 have 

effects. 



Memory 

□rug 

Placebo 

No Treatment 


70 

37 

3 


77 

43 

10 


83 

SO 

17 


90 

57 

23 


97 

63 

30 

Mean 

83.40 

50.00 

16.60 

Variance 

112.30 

109.00 

112.30 

Grand Mean 
Grand Variance 

1 

M 

50.00 

892.14 
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Descriptive! 
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95*» Cmfifence k*ml 




M 

Hm 

Std 

DnuM 

SllEnoi 

1 

Uppe Bound 

Vt iiitmfflB 

S!l33f£GQ+ 



5 

8S4W9 

10 5972 

* am 

| ^2419 

96 5581 

WOO 

9700 


PSietbe 

$ 

500090 

10 4405 

4M90 

V 37 0546 

629634 

3700 

63 00 


fte Tmmm 

5 

I66W0 

10 59U 

4.7392 

' 14419 

19 7581 

300 

50 00 


jm 

15 

50 0000 

29 rat 

77121 

33 4592 

66 5408 

500 

97 00 


ANO\A 


Extra Sco« (%) 



Sum of 
Squares 

df 

Mean Squire 

F 

% 

Between Groups 

11155 600 

2 

5577 800 

50140 

000 

Within Group! 

1334 400 

| 12 

111200 



Total 

12490000 

14 
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Person 


Kl*ild fituc 


Woman 


Ikn'lil 


Worker 


Mi1ttur> 

Service 


men and women: 


\cr 




a) 


Correct the schema, taking into: account the fundamental properties 01 the 


generalization. 


I- u^iv wine 




( | * lull I 


Worn nil 


b) The schema represents inly the female workers; modity the schema to represent 
all the workers, men anti-women. 


SVF 




orker 


Ilrk!l»h1 


Woman 



Best wishes 

Dr, Sherin El Gokhy 





